1 Bib Clustering Analysis

1.1 Background

Bib numbers are unique numbers used to identify each runner before, during, and after the race. During the race, the bib number is actually worn by the runner as a unique identifier. In some races like the Boston Marathon, bib numbers are given out in batches and used to organize the waves in which runners start a race. To make the start of a 26,000 person race more organized, the Boston Marathon in 2017 broke the runners into four, color-coded groups. To determine what group (or wave) a runner would be in, the marathon organizers used previously sumbitted qualifying times, as detailed below. (http://registration.baa.org/2017/cf/Public/iframe_EntryLists.cfm)

Red bibs (numbers 101 to 7,700) are assigned to Wave 1 (starting at 10:00 a.m.). White bibs (numbers 8,000 to 15,600) are assigned to Wave 2 (starting at 10:25 a.m.). Blue bibs (numbers 16,000 to 23,600) are assigned to Wave 3 (starting at 10:50 a.m.) Yellow bibs (numbers 24,000 to 32,500) are assigned to Wave 4 (starting at 11:15 a.m.). The break between Wave 1 and Wave 2 is a 3:10:43 marathon qualifying time. The break between Wave 2 and Wave 3 is a 3:29:27 marathon qualifying time. The break between Wave 3 and Wave 4 is a 3:57:18 marathon qualifying time.

The question at hand is can we develop an unsupervised clustering model that accurately identifies these groupings without using the information from the above paragraph? An additional question is can we confirm that the fourth group also includes runners who did not have to qualify for the marathon but instead or running for a charity group.

1.2 Data Cleaning and Exploration

We first need to convert the bib number from a factor to an int. We have to convert the factor to a charachter first though, because directly converting a factor to an int returns the underlying factor level, not the integer a factor may repersent.

Next, we can plot the finishing time against bib number and start to see several trends. First, finishing times slowly yet steadily increase, supporting the theory that faster finishers get lower bib numbers. Second, there are about four observable clusters, which match the waves organized by the Boston Marathon at the start. Finally, the last group has much more variance within it, and far slower average finishing times. These are likely the bib numbers of charity runners and other runners who did not need to qualify for the race.

## Warning: NAs introduced by coercion

This plot is rather dense, so lets use a density scatterplot to better see the distribution of the data.

From the histogram below, we can see that there are some gaps from “unused” bib numbers or runners that did not finish the race. The gap of unused bin numbers correlates to the breaks in the different wave groups, which has the added advantage of sharpening the edges of our clusters and hopefully making it easier for our model to successfully identify the correct groupings.

Now lets label the data with the right group names so we can compare our model’s output.

Red bibs (numbers 101 to 7,700) are assigned to Wave 1 (starting at 10:00 a.m.). White bibs (numbers 8,000 to 15,600) are assigned to Wave 2 (starting at 10:25 a.m.). Blue bibs (numbers 16,000 to 23,600) are assigned to Wave 3 (starting at 10:50 a.m.) Yellow bibs (numbers 24,000 to 32,500) are assigned to Wave 4 (starting at 11:15 a.m.). The break between Wave 1 and Wave 2 is a 3:10:43 marathon qualifying time. The break between Wave 2 and Wave 3 is a 3:29:27 marathon qualifying time. The break between Wave 3 and Wave 4 is a 3:57:18 marathon qualifying time.

1.3 K-Means Clustering

We can try K-means clustering to see if the algorithm can successfully identify the known clusters. Since there are four known clusters, we will provide ‘4’ as a parameter for the K-means algorithm. Additionally, since K-means starts with a random division of elements, we will set the random seed at one and run K-means 20 times, keeping the most accurate model.

As you can see in the above graph, K-means successfully identifes all four clusters and their break points perfectly. This model works well here in part because of the breaks between the four clusters.

1.4 Hierarchical Clustering

Because of the memory usage required to process hierarchical clustering with over 26,000 observations, this processing has been moved to a separate rmd.

1.5 Conclusions

From our analysis so far, the K-means clustering appears to be the best model for this problemset. Because it is a top down approach that fits the data into the number of clusters provided as an input to the model, it does an effective job of successfully finding the break points for this data. On the other hand, the hierarchical approach is less effective at identifying these break points because each successive level of hierarchical clustering is looking for the next nearest node, rather than taking the entire data into account.

2 Rank, Gender and Age Analysis

2.1 Linear Regression

2.1.1 Fourth Quarter Performance

Because we know from our previous analysis that the forth quarter rank contributes the most to the final rank most, we modeled how people’s previous performance, represented by their bib-number (***mention something about clustering analysis before), their age and their gender affects their fourth quarter rank.

## 
## Call:
## lm(formula = X30.42K.Rank ~ Bib_int + M.F + Age, data = bm_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22169.7  -4411.6   -505.7   3910.6  21603.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.663e+03  1.435e+02   11.59   <2e-16 ***
## Bib_int     5.572e-01  4.417e-03  126.15   <2e-16 ***
## M.FM        1.418e+03  8.076e+01   17.56   <2e-16 ***
## Age         4.557e+01  3.338e+00   13.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5728 on 26360 degrees of freedom
## Multiple R-squared:  0.4336, Adjusted R-squared:  0.4335 
## F-statistic:  6727 on 3 and 26360 DF,  p-value: < 2.2e-16
##  Bib_int     M.FM      Age 
## 1.300902 1.298240 1.167188

In a model predicting the fourth quarter rank by bib number, gender, and age, slope/intercept value is 1.66e+03 and is significant (p<.001). All predictors, bib number, gender, and age significatly predicts total users (p<0.001). One unit increase in bib number positively impacts (+0.557) fourth quarter rank (p<0.001), meaning increased bib number (slower past performance) is likely to predict higher fourth quarter rank (slower fourth quarter performance). Being male (vs. female) positively impacts (+1420) fourth quarter rank (p<0.001). One unit increase in age positively impacts (+45.6) fourth quarter rank (p<0.001), meaning older people will likely have higher fourth quarter rank (slower fourth quarter performance). R2 (adjusted) is .434, which means that 43.4% of variability in fourth quarter rank is explained by bib number, gender, and age. The VIFs are 1.17-1.30, representing low multicollinearity.

2.1.2 Overall Performance

We also looked to see if rank in each quarter of the race, age, and gender, to see their effect on final official time and the effect on overall rank.

## 
## Call:
## lm(formula = Official.Time.Min ~ X0.10K.Rank + X10.20K.Rank + 
##     X20.30K.Rank + X30.42K.Rank + M.F + Age, data = bm_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -106.555   -7.703   -1.614    5.424  163.943 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.671e+02  2.880e-01  580.19   <2e-16 ***
## X0.10K.Rank   1.168e-03  3.054e-05   38.23   <2e-16 ***
## X10.20K.Rank  1.118e-03  4.596e-05   24.32   <2e-16 ***
## X20.30K.Rank  1.256e-03  3.983e-05   31.52   <2e-16 ***
## X30.42K.Rank  2.217e-03  2.222e-05   99.79   <2e-16 ***
## M.FM          3.369e+00  1.696e-01   19.86   <2e-16 ***
## Age          -1.568e-01  6.934e-03  -22.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.61 on 26357 degrees of freedom
## Multiple R-squared:  0.9238, Adjusted R-squared:  0.9238 
## F-statistic: 5.329e+04 on 6 and 26357 DF,  p-value: < 2.2e-16
##  X0.10K.Rank X10.20K.Rank X20.30K.Rank X30.42K.Rank         M.FM 
##    10.564164    23.925850    17.968686     5.592565     1.393928 
##          Age 
##     1.225789
## 
## Call:
## lm(formula = Overall ~ X0.10K.Rank + X10.20K.Rank + X20.30K.Rank + 
##     X30.42K.Rank + M.F + Age, data = bm_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21645.5   -215.4    -53.2    172.4  14471.4 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.086e+02  1.851e+01 -32.874  < 2e-16 ***
## X0.10K.Rank   1.859e-01  1.963e-03  94.720  < 2e-16 ***
## X10.20K.Rank  1.745e-01  2.954e-03  59.049  < 2e-16 ***
## X20.30K.Rank  2.902e-01  2.560e-03 113.348  < 2e-16 ***
## X30.42K.Rank  4.014e-01  1.428e-03 281.027  < 2e-16 ***
## M.FM         -3.653e+01  1.090e+01  -3.350 0.000808 ***
## Age          -2.702e-01  4.457e-01  -0.606 0.544320    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 746.4 on 26357 degrees of freedom
## Multiple R-squared:  0.9904, Adjusted R-squared:  0.9904 
## F-statistic: 4.527e+05 on 6 and 26357 DF,  p-value: < 2.2e-16
##  X0.10K.Rank X10.20K.Rank X20.30K.Rank X30.42K.Rank         M.FM 
##    10.564164    23.925850    17.968686     5.592565     1.393928 
##          Age 
##     1.225789

Ranks in each quarter of the race, gender, and age all have significant effect on offcial finishing time. However, VIFs of rank in 0-10k, 10-20k, and 20-30k show that the independent variables are highly correlated. Considering our previous analysis that the forth quarter performance contributes the most to the final rank, we rebuilt new model after droping first three quarters’ rank.

## 
## Call:
## lm(formula = Official.Time.Min ~ X30.42K.Rank + M.F + Age, data = bm_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -124.170   -9.890    0.074    8.129  180.538 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.727e+02  4.250e-01  406.34   <2e-16 ***
## X30.42K.Rank  4.833e-03  1.465e-05  329.89   <2e-16 ***
## M.FM         -1.070e+01  2.229e-01  -48.01   <2e-16 ***
## Age           1.792e-01  9.881e-03   18.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.25 on 26360 degrees of freedom
## Multiple R-squared:  0.8318, Adjusted R-squared:  0.8318 
## F-statistic: 4.346e+04 on 3 and 26360 DF,  p-value: < 2.2e-16
## X30.42K.Rank         M.FM          Age 
##     1.100936     1.090192     1.127243

In a model predicting the official time by fourth quarter rank, gender, and age, slope/intercept value is 173 and is significant (p<.001). All predictors, fourth quarter rank, gender, and age significatly predict total users (p<0.001). R2 (adjusted) is .832, which means that 83.2% of variability in the offical time is explained by fourth quarter rank, gender, and age. The VIFs are 1.09-1.13, representing low multicollinearity.

One unit increase in fourth quarter rank positively impacts (+0.0048) official time. Being male (vs. female) negatively impacts (-10.70) official time. One unit increase in age positively impacts (0.1792) official time.

## 
## Call:
## lm(formula = Overall ~ X30.42K.Rank + M.F + Age, data = bm_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23503.5  -1251.2    118.5   1394.2  18545.0 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.733e+02  5.884e+01   6.345 2.27e-10 ***
## X30.42K.Rank  8.916e-01  2.028e-03 439.646  < 2e-16 ***
## M.FM         -2.521e+03  3.086e+01 -81.699  < 2e-16 ***
## Age           5.824e+01  1.368e+00  42.578  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2389 on 26360 degrees of freedom
## Multiple R-squared:  0.9016, Adjusted R-squared:  0.9016 
## F-statistic: 8.049e+04 on 3 and 26360 DF,  p-value: < 2.2e-16
## X30.42K.Rank         M.FM          Age 
##     1.100936     1.090192     1.127243

In the improved overall rank predicting model, p value is down to less than 0.001, which means all variables are significant now. And VIF are also between 1.09 to 1.13, which means low multicollinearity. R2 (adjusted) is .9016, which means that 90.16% of variability in the offical time is explained by fourth quarter rank, gender, and age. So the model has been improved in an appropriate way, and fourth quater rank, gender and age all contribute to overall rank.

One unit increase in fourth quarter rank positively impacts (+0.8916) overall rank. Being male (vs. female) negatively impacts (-2521) overall rank. One unit increase in age positively impacts (58.24) overall rank.

2.2 Official Time Distribution among Different Group

From the following two plots, we can see overall rank and official time’s distribution are related to the fourth quarter rank, gender, and age.

2.3 KNN Analysis

And we explored whether we could predict participants’ ranks in 4 categories if we got their information about age, gender and the rank of forth quarter. We classified official time into four parts by order of official time.

Then we ran the KNN test.

It seems 12-nearest neighbors (with accuracy rate of 72.5%) is a decent choice because that’s the greatest improvement in predictive accuracy before the incremental improvement trails off. So we could predict the level of the participants by their age, rank of forth quarter and gender. The cross Table of Actual/Predicted is the following:

3 Split PCA Analysis

3.1 Linear Regression 1

Let us predict the Official time of the marathon by using all the predictor variables in the data.

## 
## Call:
## lm(formula = Official.Time ~ Age + M.F + X5K + X10K + X15K + 
##     X20K + Half + X25K + X30K + X35K + X40K, data = splits)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.307  -0.601  -0.154   0.389 121.528 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.891228   0.097179   9.171  < 2e-16 ***
## Age          0.014733   0.001206  12.215  < 2e-16 ***
## M.FM         0.208143   0.028784   7.231 4.92e-13 ***
## X5K         -0.025793   0.026224  -0.984 0.325347    
## X10K         0.024532   0.027701   0.886 0.375839    
## X15K         0.023922   0.018976   1.261 0.207449    
## X20K        -0.165660   0.039651  -4.178 2.95e-05 ***
## Half         0.168417   0.039072   4.310 1.64e-05 ***
## X25K        -0.043706   0.011515  -3.796 0.000148 ***
## X30K        -0.008966   0.008662  -1.035 0.300603    
## X35K        -0.216029   0.006934 -31.154  < 2e-16 ***
## X40K         1.253935   0.003514 356.793  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.042 on 26200 degrees of freedom
## Multiple R-squared:  0.9976, Adjusted R-squared:  0.9976 
## F-statistic: 1.007e+06 on 11 and 26200 DF,  p-value: < 2.2e-16
##         Age        M.FM         X5K        X10K        X15K        X20K 
##    1.193197    1.290768   68.169149  309.461352  343.112512 2931.789934 
##        Half        X25K        X30K        X35K        X40K 
## 3187.889602  425.862092  379.551362  356.622865  123.269791

In the model predicting the Official time of the marathon using Age, Gender and all the split times, we can see that all the coefficients are not significant. Only the intercept, Age and some of the split times are significant. R2(Adjusted) is .99 which is almost close to 1. This is because we have used all the split times till 40K. So, the model is explaining 99% of the variability in the Official time using all the split times, Gender, Age. But, the main problem is with the multicollinearity in the data. If we look at the vif values of the split times, they are more than 10 and this suggests that there is lot of multicollinearity in the data. We will overcome this by using PCA, PCR. Before that, let’s just predict the Official time of the marathon using Age, Gender and Half time of the marathon.

3.2 Linear Regression 2

## 
## Call:
## lm(formula = Official.Time ~ Age + M.F + Half, data = splits)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -129.710   -8.495   -2.382    5.639  192.668 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.599737   0.587650 -19.739   <2e-16 ***
## Age          -0.015724   0.007779  -2.021   0.0432 *  
## M.FM          5.965982   0.183971  32.429   <2e-16 ***
## Half          2.235712   0.005038 443.790   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.37 on 26208 degrees of freedom
## Multiple R-squared:  0.8988, Adjusted R-squared:  0.8988 
## F-statistic: 7.758e+04 on 3 and 26208 DF,  p-value: < 2.2e-16
##      Age     M.FM     Half 
## 1.157320 1.229639 1.235855

In this model, we predict the Official time of the marathon with Age, Gender and Half time of the marathon. All the coefficients are statistically significant. R2(Adjusted) is .8993. This means it is explaining 89% of the variability in Official Times with Age, Gender and Half time. And also, VIF values are 1.15, 1.22, 1.23 which indicates that there is no multicollinearity.

3.3 PCA and PCR

Let us see whether we can apply Principal Component Analysis and Regression to our split times.

Principal Component analysis of the split times suggest that 1st Principal Component i.e PC1 covers 97% of the variance in the data. PC1 and PC2 cover 99% of the variance in the data. We can observe this in the above plot. Let us perform regression analysis based on the Principal Component Analysis which is Principal Component Regression(PCR). We can validate the R2 based on the number of componenets used.

## Data:    X dimension: 26212 9 
##  Y dimension: 26212 1
## Fit method: svdpc
## Number of components considered: 9
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           42.02    11.58    5.578    3.384    2.701    2.322    2.104
## adjCV        42.02    11.58    5.578    3.384    2.701    2.321    2.103
##        7 comps  8 comps  9 comps
## CV       2.100    2.057    2.056
## adjCV    2.099    2.056    2.056
## 
## TRAINING: % variance explained
##                1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X                97.00    99.32    99.78    99.89    99.94    99.96
## Official.Time    92.41    98.24    99.35    99.59    99.70    99.75
##                7 comps  8 comps  9 comps
## X                99.98   100.00   100.00
## Official.Time    99.75    99.76    99.76

In the Principal Component Regression, 1 component cover 97% of the variance in the split times and 92% of the variance in the Official time where as 2 components cover 99% of the variance in the split times and 98% of the variance in the Official time. R2 value is close to .95 if we consider 1 component and it is close to .98 if we considered 2 compoenents i.e 2 principal components cover 98% of the variance in the Official time.

4 Analysis using different models

4.1 Preprocessing and Exploration

Let’s drop columns that are not required and handle missing values in the dataset

There are 26259 rows after dropping the columns and removing NA and blank values.

Let’s Calculate Pace of the runners between each time split. Pace is calculated as time taken per km. So if your pace is 5’ 20" and you have your measurement set to metric, it means you have taken 5 mins & 20 secs to run each kilometer. Lower values mean you are travelling faster. For 2017 Boston marathon, total distance that needs to be covered by runner is 42.195 km. So for half time, runner has to cover 21.1 km.

4.1.1 Extracting features and Targets

Let’s subset age and pace columns until the Half time from the dataset which will be features for our models and will be predicting the Official Time.

4.1.2 Train and Test splits

Let’s split the dataset into 70% train set and 30% test set and setting the seed to 1 so that there is no randomness in our split data.

4.1.3 Correlation Matrix

Let’s plot correlation matrix of training data.

We can observe that there exists a multi-collinearity between the pace variables and also the pace variables are strongly correlated with Official Time.

4.1.4 Standardizing the Train and Test Data

In contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the Ridge and Lasso regression, so that all the predictors are on the same scale. So let’s scale X_train, X_test, y_train and y_test.

4.2 Models

4.2.1 Ordinary Least Squares (OLS)

Let’s build an Ordinary Least Squares model using our train data and predict official time with test data.

## 
## Call:
## lm(formula = Official.Time.Min ~ ., data = train1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -204.932   -7.252   -2.160    4.780  120.507 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.796316   0.676113  -2.657  0.00789 ** 
## Age           0.022852   0.008584   2.662  0.00777 ** 
## M.FM          4.576376   0.200729  22.799  < 2e-16 ***
## Pace0k.5k     6.881673   0.410370  16.769  < 2e-16 ***
## Pace5k.10k   -0.533937   0.594195  -0.899  0.36888    
## Pace10k.15k  11.597146   0.493059  23.521  < 2e-16 ***
## Pace15k.20k  14.046695   0.278331  50.468  < 2e-16 ***
## Pace20k.Half 12.522291   0.255354  49.039  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.2 on 18373 degrees of freedom
## Multiple R-squared:  0.9169, Adjusted R-squared:  0.9168 
## F-statistic: 2.894e+04 on 7 and 18373 DF,  p-value: < 2.2e-16
##          Age         M.FM    Pace0k.5k   Pace5k.10k  Pace10k.15k 
##     1.182851     1.232578    13.203802    29.891158    23.848568 
##  Pace15k.20k Pace20k.Half 
##    10.475560     7.854640
##          R2     RMSE      MAE
## 1 0.9179003 11.94196 8.264375

91.7900286% variability in Official time is explained by the predictor variables. Pace between 15k and 20k and Pace between 20k and 21.1k(Half time) are top two important features in determining the official time. We can also observe from vif values that there exists a multi-collinearity between independent variables. So let’s build a model which will help us in solving multi-collinearity problem.

4.2.2 Ridge Regression

Ridge regression shrinks the coefficients of the independent variables to prevent multicollinearity. We need to calculate the regularization parameter lambda that adjusts the amount of coefficient shrinkage. The best lambda for the data, can be defined as the lambda that minimize the cross-validation prediction error rate. This can be determined using the function cv.glmnet().

The first vertical dotted line is where the lowest MSE is. The second vertical dotted line is within one standard error. The labels of above the graph shows how many non-zero coefficients in the model. The best lambda value found here is 0.0934576. Let’s predict the Official time for test data using the best lambda value.

## [1] 0.08550857
##         MAE      RMSE        X1
## 1 0.2064382 0.2924185 0.9122858
##          Age         M.FM    Pace0k.5k   Pace5k.10k  Pace10k.15k 
##   0.03490199   3.84633811   6.00007979   5.85997084   9.56233432 
##  Pace15k.20k Pace20k.Half 
##  11.56372032  11.26372657

We got an r-square value of 91.2285772% using Ridge regression.

4.2.3 The Lasso

Lasso regression also reduces the multi-collinearity between variables by shrinking the coefficients to zero. The same function glmnet( ) with alpha set to 1 will build the Lasso regression model.

## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to
## unique 'x' values

Here, we see that the lowest MSE is when \(\lambda\) appro = 0.00465. It has 5 non-zero coefficients.

## [1] 0.08211911
##         MAE      RMSE        X1
## 1 0.2064382 0.2924185 0.9122858
##          Age         M.FM    Pace0k.5k   Pace5k.10k  Pace10k.15k 
##    0.0214768    4.5631016    6.6009531    0.0000000   11.3663131 
##  Pace15k.20k Pace20k.Half 
##   14.0402185   12.5053273
##          Age         M.FM    Pace0k.5k  Pace10k.15k  Pace15k.20k 
##    0.0214768    4.5631016    6.6009531   11.3663131   14.0402185 
## Pace20k.Half 
##   12.5053273

The main problem with lasso regression is when we have correlated variables, it retains only one variable and sets other correlated variables to zero. That will possibly lead to some loss of information resulting in lower accuracy in our model.

4.2.4 Decision tree

## 
## Regression tree:
## rpart(formula = Official.Time.Min ~ ., data = train1, method = "anova")
## 
## Variables actually used in tree construction:
## [1] Pace10k.15k  Pace15k.20k  Pace20k.Half
## 
## Root node error: 32913426/18381 = 1790.6
## 
## n= 18381 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.595813      0   1.00000 1.00008 0.0111647
## 2 0.128601      1   0.40419 0.40571 0.0046479
## 3 0.096503      2   0.27559 0.28004 0.0041649
## 4 0.022903      3   0.17908 0.18117 0.0028424
## 5 0.018829      4   0.15618 0.16345 0.0027726
## 6 0.015728      5   0.13735 0.14101 0.0026243
## 7 0.011889      6   0.12162 0.12213 0.0024654
## 8 0.010000      7   0.10973 0.11272 0.0023320

##          R2     RMSE      MAE
## 1 0.8863439 14.05541 10.19628

We got an r-square value of 88.6343888% using Decision Tree. The main disadvantage of decision tree is that, we get low bias and high variance predictions when we have sufficient depth in the tree. High variance means decision tree model changes a lot with changes in training data resulting in changes in accuracy. In order to control the variance, we go for a technique called Bagging.

4.2.5 Bagging Tree

In Bagging, we take ensemble of models having low bias and high variance as the base model. By doing ensembling of models, we get low bias and low variance predictions. Below we are building an ensemble of decision trees using treebag method.

##          R2     RMSE      MAE
## 1 0.9005626 13.14366 9.290779

We got an r-square value of 90.0562578% using Bagging Tree. The main disadvantage of bagging is that the predictions from the decision trees are highly correlated. In order to reduce the correlation between decision trees, we go for Random forest model.

4.2.6 Random Forest

Random Forest is an extension of Bagging where we subset the features along with bootstrap of rows with replacement.

##                 Length Class  Mode     
## call                4  -none- call     
## type                1  -none- character
## predicted       18381  -none- numeric  
## mse               100  -none- numeric  
## rsq               100  -none- numeric  
## oob.times       18381  -none- numeric  
## importance          7  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y               18381  -none- numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call
## 
## Call:
##  randomForest(formula = Official.Time.Min ~ ., data = train1,      ntree = 100) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 127.8989
##                     % Var explained: 92.86

##          R2     RMSE      MAE
## 1 0.9005626 13.14366 9.290779

We got an r-square value of 90.0562578% using Random Forest.

4.2.7 Gradient Boosting

Boosting is another approach for improving the predictions resulting from Decision Tree. In Gradient Boosting, we build ensemble of decision trees sequentially and the predictions of individual trees are summed sequentially. Every decision tree tries to recover the loss (difference between actual and predicted values) by fitting the tree on residuals of the previous tree. This results in model with better accuracy.

We got an r-square value of 91.5356632% using Gradient Boosting. We can see that Pace between 15k and 20k and Pace between 20k and 21.1K are by far the most important variables in our gbm model.

4.2.8 xgboost

Extreme Gradient Boosting or xgboost gives better approximations of predictions over gradient boosting and computationally it is fast in training the data.

We got an r-square value of 92.4807346% using Gradient Boosting.

4.3 Conclusion

  • Out of all the models, xgboost is having the better r-square value of 92.4807346%.
  • We can see that Pace between 15k and 20k and Pace between 20k and 21.1K are by far the top two important variables in all the models. So if a runner has better pace between 15k and 20k and 20k to 21.1k has the better chance of winning the race.

5 Distance Traveled Analysis

5.1 Background

In our previous data exploration, we conducted extensive analysis of how a runner’s hometown affected finishing time. We found strong differneces amongst home states and home countries, with faster runners coming from African countires like Ethiopia and Kenya and slower runners coming from New England states near the start of the marathon.

We were curious to see if the distance traveled had a more direct relationship with finishing time. To answer this question, we had to do some feature engineering. In data science, we rarely have all the data needed and frequenlty need to build our own features based on some aspects of the existing data.

To attak this problem, we leveraged work from a previous class using Python to pass place names to Google Map’s geocoding API, which returns locational metadata including lattitude and longitude. To determine distance between a runner’s provided hometown and the race start, we built a custom function to calculate the distance using the haversine forumla. All details are included in the attached Jupyter Notebook and html files. Once the caclulations were complete, the distance traveled for each runner was merged back with our original dataset using bib number as a unique identifier.

5.2 Linear Regression Fitting and Analysis

First we will build a linear model, and then use that model to predict the finishing time for each runner as a function of the distance traveled from their hometown. Once we build the model, we can plot the actual data with the predicted data as a line. As you can see from the plot below, there appears to be a very weak negative correlation between distance traveled and finishing time. Let’s examine the model output for more details.

We can further analyze the model by looking at specific outputs. The R-squared value of this model is 0.00713. The p-value for the distance term is 0. The coefficent for the distance variable is -0.00203. The VIF is 1.

These outputs confirm what we saw from the plot. For every mile further away a runner’s home town is, a runner is expected to run 0.00203 minutes faster. However, we can see that this model is a very poor fit for the data with very high error residuals.

5.3 Conclusions

This is an excellent example that really detailed feature engineering and even a very small p-value does not lead to a good model. Distance traveled results in a low adjusted R-squared value of 0, which is definitely stastically significant. However, the model has little predictive power because of its low adjusted R-squared value and obviously large error residuals.